import numpy as np
import pandas as pd
import dalex as dx
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')
import plotly
plotly.offline.init_notebook_mode()
full_data = pd.read_csv('hotel_bookings.csv')
full_data.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
full_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year 119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults 119390 non-null int64 10 children 119386 non-null float64 11 babies 119390 non-null int64 12 meal 119390 non-null object 13 country 118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest 119390 non-null int64 17 previous_cancellations 119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type 119390 non-null object 23 agent 103050 non-null float64 24 company 6797 non-null float64 25 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float64 28 required_car_parking_spaces 119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null object dtypes: float64(4), int64(16), object(12) memory usage: 29.1+ MB
# for the simplicity we will use only subset of variables
num_features = ["lead_time", "adults", "children", "babies", "booking_changes", "previous_cancellations", "is_repeated_guest"]
cat_features = ["arrival_date_month", "deposit_type", "customer_type"]
# Separate features and predicted value
features = num_features + cat_features
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]
# preprocess numerical feats:
# for most num cols, except the dates, 0 is the most logical choice as fill value
# and here no dates are missing.
num_transformer = SimpleImputer(strategy="constant")
# Preprocessing for categorical features:
cat_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
("onehot", OneHotEncoder(handle_unknown='ignore'))])
# Bundle preprocessing for numerical and categorical features:
preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features),
("cat", cat_transformer, cat_features)])
rf_model = RandomForestClassifier(n_estimators=160,
max_features=0.4,
min_samples_split=2,
n_jobs=-1,
random_state=0)
model_pipe = Pipeline(steps=[('preprocessor', preprocessor),
('model', rf_model)])
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123)
model_pipe.fit(X_train, y_train)
preds = model_pipe.predict(X_test)
score = accuracy_score(y_test, preds)
print(f"Accuracy_score: {round(score, 4)}")
Accuracy_score: 0.7732
explainer = dx.Explainer(model_pipe, X_train, y_train)
Preparation of a new explainer is initiated -> data : 83573 rows 10 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 83573 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x000002D6EF26C430> will be used (default) -> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems. -> predicted values : min = 0.0, mean = 0.371, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.97, mean = -0.000545, max = 0.984 -> model_info : package sklearn A new explainer has been created!
idx = 85541
observation = X_test.loc[[idx]]
prediction = model_pipe.predict(observation)
print(f'Predicted value for the selected observation: {prediction[0]}, real value: {y_test.loc[idx]}')
Predicted value for the selected observation: 0, real value: 0
observation.squeeze()
lead_time 5 adults 1 children 0.0 babies 0 booking_changes 0 previous_cancellations 0 is_repeated_guest 1 arrival_date_month March deposit_type No Deposit customer_type Transient Name: 85541, dtype: object
break_down = explainer.predict_parts(observation, type='break_down', order=X_test.columns.to_list())
break_down.plot()
Based on break down plot, the greatest negative inpact on prediction has lead_time, which equals to number of days between the date of booking and the arrival or cancelation date. Also deposite_type has meaningful effect.
shap = explainer.predict_parts(observation, type='shap', B=5)
shap.plot()
We see that small lead_time decreases probability of cancelation and it seems to be logical because it's easier to plan our immediate future than our plans for next few months. Fact that someone is repeated guest also makes cancelation less likely.
Let's note that on shap plot customer_type has positive attribution to model prediction while on break down it has really small but still negative. It may suggest that there are some interactions between customer_type and other variables.
idx2 = 11159
observation2 = X_test.loc[[idx2]]
observation2.squeeze()
lead_time 259 adults 2 children 0.0 babies 0 booking_changes 0 previous_cancellations 0 is_repeated_guest 0 arrival_date_month April deposit_type No Deposit customer_type Transient Name: 11159, dtype: object
shap = explainer.predict_parts(observation2, type='shap', B=5)
shap.plot()
We see that this time lead_time which is much bigger than in previous example has the greatest positive contribution to model prediction. Once again, this seems intuitive because, as I mentioned earlier, a lot can happen in such a long time, e.g. someone can lose their job or a war can break out.